Intro

Tuomas Keski-Kuha 6.11.2021

Here’s a link to my IODS-project

Hi!

Nice to be on the Intro course on Open Data Science. I work in risk management field in insurance business. I have studied applied mathematics and statistics in University of Helsinki some 10 years ago. I expect to learn some basics about the open data science. I found this course at the Mooc courses page.

Looking forward to learn more about R, and my goal is to study more R statiscs so I’m albe to use it in my work also, and perhaps also as a hobby! :D So starting with happy feelings, but in a some hurry to finish first week task


Regression and model validation

Tuomas Keski-Kuha 13.11.2021

Exploring the data

The data is from the study where intention was to study different variables affecting students grades. It is a question type study.

# first let's read the data: 
learning2014 <- read.csv(file = "data/learning2014.csv", 
                      stringsAsFactors = TRUE)

# let's study the data structure etc. 
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
##   gender age attitude     deep  stra     surf points
## 1      F  53       37 3.583333 3.375 2.583333     25
## 2      M  55       31 2.916667 2.750 3.166667     12
## 3      F  49       25 3.500000 3.625 2.250000     24
## 4      M  53       35 3.500000 3.125 2.250000     10
## 5      M  49       37 3.666667 3.625 2.833333     22
## 6      F  38       38 4.750000 3.625 2.416667     21

In the data we have 166 observations (studens), and 7 variables. Some of the variables are describing the studens (gender, age, attitude) and (deep, stra, surf) are stundens’ answers in the study. Points are the points in the test, and the result variable.

Now let’s dig deeper in to the data, and also get an overview of it:

#  ggplot library is allready installed and let's get the library into use
library(ggplot2)

#summarise data
summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00
# let's also use library GGally

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
# then let's draw a plot where also the possible dependencies can be seen

# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p

In the summary picture you can really get the feeling of the data. Gender gives the coloring in the picture and all the information can be seen depending on the gender. In the picture you can see all the distributions of the variables.

Here’s few remarks on the distributions:

  • Age is concentrated around the usual student ages (under 30), but there are still lots of older students in the stud.
  • All variables have significant variability
  • There seems to be attitude difference between gender (females have better attitude)

Here’s few remarks on the correlations:

  • Attitude has biggest correlation between points
  • Stra has also a small positive correlation between points as surf has a small negative correlation between points
  • All other correlations are reasonably small, but there’s some correlation differences when examining between genders, like suface answers and deep answers for males
  • Attitude seems to be gender related

Linear model fitting

Next, let’s apply linear model to the data. Response or dependent variable (is the variable you think shows the effect) here is the points and all other are predictors orindependent variables (is the variable you think is the cause of the effect).

Let’s choose three most valid independent variables by looking biggest correlations between the target variable (points). We get attitude, stra and surf as three biggest absolute correlation values.

 # Let's create a linear regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

We fitted linear model: y = a0 + a1x1 + a2x2 +a3x3 Looking the results:

  • the constant (a0) has a quite large estimated value, but also standard error is quite large. Corresponding p value, it looks like that zero hypothesis (a0 = 0) is unlikely (probability for that is 0,3 %)
  • a1 -value or the attitude variable multiplier is statistically very significant, so the zero hypothesis (a1 = 0) is very unlikely (p-value is really small)
  • other parameters a2, a3 have not significant p-values, so it seems reasonable not to have x2 and x3 in the model (zero hyphotesis is likely a2 = a3 = 0)

Of course model will fit differently if we just ignore first one of the non significant variables, but here it seems that small a2 or a3 would not enchange the model that much, and even those extra parameters could even worsen the models prediction capability. But let’s fit a linear model with two variables (attitude and stra):

 # Let's create a linear regression model with multiple independent variable
my_model_2 <- lm(points ~ attitude + stra, data = learning2014) 
summary(my_model_2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Dropping the surf value did not affect that much, but at least it got for the a0 a better fit. Still overall results aren’t that impressive without the surf, if you look the residual standard error of residuals and also the residual values. So it seems that the best thing here would be to drop also the stra value (usually simple the better). It looks like residuals have the same kind of distribution as in the previous model.

Multiple R-squared value (ranges 0-1) gives answer the the proportion of the variance in the response variable (the effect) of a regression model that can be explained by the predictor variables (the cause). Low value here would indicate that the variance in response variable (points) is not explained that well by the predictors variables (attitude and stra). And here indeed R-squared value is not impressive, and there’s lot of variability which is not explained well by the predictors.

Also the adjusted R-squared did not change that much between the fist and the second model, and its score 0,2. Adjusted R-squared is more handy than Multiple R-squared, because it takes into account the number variables in the model and adjust the Multiple R-squared accordingly (R-squared willi increase as you add more variables). So with adjusted R-squared one can really evaluate the difference between the models when we consider the variance in the response variable.

For the fun of it, let’s also try fitting a simple line (y = a0+a1x1) where x1 is attitude.

 # Let's create a linear regression model with  just one independent variable
my_model_3 <- lm(points ~ attitude, data = learning2014) 
summary(my_model_3)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

We observe that the standard error of the residuals became a bit larger even though the fitting of the parameters was more successfull. Adjusted R-squared still became lower which is not good, but is the difference here significant. I guess not if you observe the standard errors of the residuals (model vs. actual values) and compare them with previous modesl with multiple predictive variables.

Graphical model validation

Let’s first draw a simple regression line to the data. Attitude being predictive variable:

# initialize plot with data and aesthetic mapping
p1 <- ggplot(learning2014, aes(x = attitude, y = points))
# define the visualization type (points)
p2=p1+geom_point()
# add a regression line
p3 = p2 +geom_smooth(method = "lm")
# draw the plot
p3
## `geom_smooth()` using formula 'y ~ x'

Let’s draw diagnostic plots and choose the following plots:

  • Residual vs Fitted values
  • Normal QQ-plot
  • Residuals vs Leverage
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5

plot(my_model_3, which = c(1, 2, 5))

Commenting the plots:

  • Residuals vs Fitted values: In this plot we can see that the linearity is a valid assumption. Still there is lots of variablity but perhaps if we had more observations linearity assumption would even be stronger… or not. There’s also some outliers (145, 56, 35) which may effect the linearity assumptions and also the fitting. Also in this plot we don’t have a clear pattern, so it indicates that errors have a constant variance, and the variability of actual error sizes does not depend on the predictive variables.
  • Normal QQ-plot: It looks like normality assumption of the errors is quite valid here. In the plot we have a nice line, and it indicates that the errors comes from normal distribution. Still there are some outliers which do not fit so nicely, but as it is the extreme values are rare. But just looking the picture normality assumtion is not far fetched.
  • Residuals vs. Leverage: This plot gives the meaning of the outliers (single observations) for the model, and do the outliers have an unusually high impact to the model. It looks like there is no outliers which are significant for the model. Those points which could have a high impact does not fall far away from the average of standardized residual. Outliers in the data are that close to average of attitude, so outliers are not so effective.

Logistic regression

Tuomas Keski-Kuha 21.11.2021

Data exploration and finding relevant variables

# access a few libraries: 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)

# first let's read the data: 
alc<- read.csv(file = "https://github.com/rsund/IODS-project/raw/master/data/alc.csv", 
                      stringsAsFactors = FALSE, sep = ",")

# let's study the data structure etc. 
glimpse(alc)
## Rows: 370
## Columns: 51
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",~
## $ sex        <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",~
## $ age        <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,~
## $ address    <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R", "U", "U", "U",~
## $ famsize    <chr> "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "LE3", "LE~
## $ Pstatus    <chr> "T", "T", "T", "T", "T", "T", "T", "T", "T", "A", "A", "T",~
## $ Medu       <int> 1, 1, 2, 2, 3, 3, 3, 2, 3, 3, 4, 1, 1, 1, 1, 1, 2, 2, 2, 3,~
## $ Fedu       <int> 1, 1, 2, 4, 3, 4, 4, 2, 1, 3, 3, 1, 1, 1, 2, 2, 1, 2, 3, 2,~
## $ Mjob       <chr> "at_home", "other", "at_home", "services", "services", "ser~
## $ Fjob       <chr> "other", "other", "other", "health", "services", "health", ~
## $ reason     <chr> "home", "reputation", "reputation", "course", "reputation",~
## $ guardian   <chr> "mother", "mother", "mother", "mother", "other", "mother", ~
## $ traveltime <int> 2, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 3, 1, 2, 1,~
## $ studytime  <int> 4, 2, 1, 3, 3, 3, 3, 2, 4, 4, 2, 1, 2, 2, 2, 2, 3, 4, 1, 2,~
## $ schoolsup  <chr> "yes", "yes", "yes", "yes", "no", "yes", "no", "yes", "no",~
## $ famsup     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye~
## $ activities <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "no", "no", ~
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "no"~
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye~
## $ internet   <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes~
## $ romantic   <chr> "no", "yes", "no", "no", "yes", "no", "yes", "no", "no", "n~
## $ famrel     <int> 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 5, 5, 3, 3,~
## $ freetime   <int> 1, 3, 3, 3, 2, 3, 2, 1, 4, 3, 3, 3, 3, 4, 3, 2, 2, 1, 5, 3,~
## $ goout      <int> 2, 4, 1, 2, 1, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 1, 2,~
## $ Dalc       <int> 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1,~
## $ Walc       <int> 1, 4, 1, 1, 3, 1, 2, 3, 3, 1, 1, 2, 3, 2, 1, 2, 1, 1, 1, 1,~
## $ health     <int> 1, 5, 2, 5, 3, 5, 5, 4, 3, 4, 1, 4, 4, 5, 5, 1, 4, 3, 5, 3,~
## $ n          <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,~
## $ id.p       <int> 1096, 1073, 1040, 1025, 1166, 1039, 1131, 1069, 1070, 1106,~
## $ id.m       <int> 2096, 2073, 2040, 2025, 2153, 2039, 2131, 2069, 2070, 2106,~
## $ failures   <int> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,~
## $ paid       <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no~
## $ absences   <int> 3, 2, 8, 2, 5, 2, 0, 1, 9, 10, 0, 3, 2, 0, 4, 1, 2, 6, 2, 1~
## $ G1         <int> 10, 10, 14, 10, 12, 12, 11, 10, 16, 10, 14, 10, 11, 10, 12,~
## $ G2         <int> 12, 8, 13, 10, 12, 12, 6, 10, 16, 10, 14, 6, 11, 12, 12, 14~
## $ G3         <int> 12, 8, 12, 9, 12, 12, 6, 10, 16, 10, 15, 6, 11, 12, 12, 14,~
## $ failures.p <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,~
## $ paid.p     <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no~
## $ absences.p <int> 4, 2, 8, 2, 2, 2, 0, 0, 6, 10, 0, 6, 2, 0, 6, 0, 0, 4, 4, 2~
## $ G1.p       <int> 13, 13, 14, 10, 13, 11, 10, 11, 15, 10, 15, 11, 13, 12, 13,~
## $ G2.p       <int> 13, 11, 13, 11, 13, 12, 11, 10, 15, 10, 14, 12, 12, 12, 12,~
## $ G3.p       <int> 13, 11, 12, 10, 13, 12, 12, 11, 15, 10, 15, 13, 12, 12, 13,~
## $ failures.m <int> 1, 2, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,~
## $ paid.m     <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "yes", "no",~
## $ absences.m <int> 2, 2, 8, 2, 8, 2, 0, 2, 12, 10, 0, 0, 2, 0, 2, 2, 4, 8, 0, ~
## $ G1.m       <int> 7, 8, 14, 10, 10, 12, 12, 8, 16, 10, 14, 8, 9, 8, 10, 16, 1~
## $ G2.m       <int> 10, 6, 13, 9, 10, 12, 0, 9, 16, 11, 15, 0, 10, 11, 11, 15, ~
## $ G3.m       <int> 10, 5, 13, 8, 10, 11, 0, 8, 16, 11, 15, 0, 10, 11, 11, 15, ~
## $ alc_use    <dbl> 1.0, 3.0, 1.0, 1.0, 2.5, 1.0, 2.0, 2.0, 2.5, 1.0, 1.0, 1.5,~
## $ high_use   <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE,~
## $ cid        <int> 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 3009, 3010,~
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "n"          "id.p"       "id.m"      
## [31] "failures"   "paid"       "absences"   "G1"         "G2"        
## [36] "G3"         "failures.p" "paid.p"     "absences.p" "G1.p"      
## [41] "G2.p"       "G3.p"       "failures.m" "paid.m"     "absences.m"
## [46] "G1.m"       "G2.m"       "G3.m"       "alc_use"    "high_use"  
## [51] "cid"

Data set information (straight from the web page):

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

Still we are going to use the data to predict binary variable for alcohol consuption (high_use).

Expected 4 variables to effect alcohol consumption are as follows (the variable picking was biased as I did see the results in the datacamp exercise :D ). Anyway let’s pick

  • sex: it is quite typical that males drink more
  • absences: one could think that absences would go up for high alcohol users due to hangovers and other side effects
  • failures: same reasoning as for absences
  • goout: thinking here is quite obvious, someone goes out more is perhaps drinkin more

Let’s draw some plots and tables if we could see whether there might be any relationship between target variable (high_use) and predictors.

# produce summary statistics by group for studying relationship between high_use and sex
alc %>% group_by(sex, high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 x 3
## # Groups:   sex [2]
##   sex   high_use count
##   <chr> <lgl>    <int>
## 1 F     FALSE      154
## 2 F     TRUE        41
## 3 M     FALSE      105
## 4 M     TRUE        70
# let's draw a plot to study how absences might relate to the target variable
g1 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))

g1 + geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")

  • sex: It looks like there is big difference in high alcogol users between sexes. Males tend to use much more, as expected before.
  • absences: In the plot we draw absences with sex, and there we could also see that in high users there is more tendency to be absent from the classes. It is interesting that for females the difference is not that significant as it is for males.
# initialize a plot of high_use related to gout
g2 <- ggplot(data = alc, aes(x = goout, fill = high_use))

# define the plot as a bar plot and draw it
g2 + geom_bar() + ggtitle("Student go-out-tendency by high alcohol consumption")

#  let's count few crosstables for examining high_use relatedness to failures
table(high_use = alc$high_use, failures = alc$failures)
##         failures
## high_use   0   1   2   3
##    FALSE 238  12   8   1
##    TRUE   87  12   9   3
  • goout: It can be seen from the bar chart that students who go out more will use alcohol more. The effect is quite clear, and this was quite expected.
  • failures: There can be seen a tendency to fail classes in the past if one is high user as expected, but this variable is tricky because 1-3 failure classes has so few instances. It could disturb modeling with this feature.

Logistic regression model

In this chapter, let’s apply a logistic regression model to predict high alcohol use. For the model we use those variables which seemed have an effect to high alcohol use (sex, absences, go_out, failures).

# find the model with glm() (target high_use)
m <- glm(high_use ~ sex + goout + absences + failures, data = alc, family = "binomial")

summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + goout + absences + failures, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1521  -0.7929  -0.5317   0.7412   2.3706  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.14389    0.47881  -8.654  < 2e-16 ***
## sexM         0.97989    0.26156   3.746 0.000179 ***
## goout        0.69801    0.12093   5.772 7.83e-09 ***
## absences     0.08246    0.02266   3.639 0.000274 ***
## failures     0.48932    0.23073   2.121 0.033938 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 369.50  on 365  degrees of freedom
## AIC: 379.5
## 
## Number of Fisher Scoring iterations: 4

Model summary seems to tell that all of the predictive variables have good fit overall and are statistically significant (all others have p-value < 0.001, but for failures the p-value is 0,03 which is not that significant, and it is reasonable to assume that it won’t be significant for prediction power). Also here we cannot be sure that there might be strong correlations between predictors, so perhaps it could be wise to study those also.

Let’s also present the coefficients and also oddsrations, and confidence intervals:

# compute odds ratios (OR) round the results
OR <- coef(m) %>% exp %>% round(2)

# compute confidence intervals (CI), round the results
CI <- confint(m) %>% exp %>% round(2)
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##               OR 2.5 % 97.5 %
## (Intercept) 0.02  0.01   0.04
## sexM        2.66  1.61   4.49
## goout       2.01  1.59   2.56
## absences    1.09  1.04   1.14
## failures    1.63  1.04   2.59

Did not have time to make adjustmens with the Intercept factor, but luckily it’s near zero, so it won’t matter that much in this case. It seems that sex and goout odds ratios are way bigger than one, and that would indicate that they are relevant for probability of high alcohol use. Failures and absences do still have figure more than one but just slightly, so they not seem to be so relevant for the high alcohol use probablity. Also the confidence levels for these values indicate that they are not that significant.


Predictions

Let’s use the logistic regression model for predicting high alcohol use from the selected data. Well calculate probability for high alcohol use and then form a new variable which is true if probability is bigger than 0.5 and false if otherwise. After that we make a cross table for model prediction and actual high use variable (results of the study), and also a graphical presentation of the prediction vs. results.In the end also

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   243   16
##    TRUE     61   50
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

Model gives sensible results. We can see that there are wrong results 61+16 vs. correct ones 243+50.

Let’s also calculate a loss function, and use that for calculating inaccuracy of the model. This can also be calculated from the figures in the previous paragraph (incorrect cases divided by all cases).

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
round(loss_func(class = alc$high_use, prob = alc$probability), 4)
## [1] 0.2081

So the model is correct on average for 79 % of the cases and incorrect for 21 % of the cases.

It seems that it’s hard to predict high use (61 times model get it wrong vs. 50 correct ones), but for low uses it’s far more precise (16 wrong vs. 243 correct). But if you’d use simple guessing (flip of a coin for a model) it will be correct approx 50 % of the cases, so the model gives reasonable accuracy.


Cross-validation

Let’s make a 10 fold cross-validation for validating the logistic regression model we used above:

  • We divide data into 10 same size segments (data groups 1, 2, 3, …, 10).
  • Then lets fit the model for data groups 2 to 10 (training set) and measure prediction error for the group 1 (test set).
  • After that we use data groups 1, 3, 4…, 10 and measure prediction error for group 2.
  • We continue till we have used all the data this way. Last prediction error is calculated for the data group 10 fitting the data from groups 1 to 9.
  • Then we can average all these prediction errors for the validation figure.
# setting the random seed, so we can re-calculate exactly same results over again
seed = 123
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross-validation, this is not adjusted value, but a raw inaccuracy rate in the cross-validation
round(cv$delta[1], 4)
## [1] 0.2081

So cross-validation gives a really close figure comparing to prediction error we had when used all the data for training.


Cross-validating different models

# Let's define the wanted predictor sets (just using the column numbers: 

# first test set has many predictors (alomost all of them and more than 40)
testi_1 <- c(1:24, 27, 29:30,  31, 33:48, 50)
# we drop few off in testi_2 predictor set, but it still has quite many (approx 30)
testi_2 <- c(2:24, 29:31, 33:38, 50)
# we drop some more, still keeping the most relevant (approx 20)
testi_3 <- c(2:11, 12, 17, 24, 29:30,  31, 33:35, 50)
# now we keep just 10
testi_4 <- c(2:5, 24, 27, 30,  31, 33, 50)
# the last predictor set has only 3, number 50 is high_use (target variable)
testi_5 <- c(2, 24, 33, 50)


# build a list for different prodictor set
testi_setti <- list(testi_1, testi_2, testi_3, 
                    testi_4, testi_5)

# initialising the result matrix
tulokset <- matrix(nrow = 2, ncol =5)

# defining a loop which takes one testi_setti items (predictor vectors) at a time, and calculates training error for the whole data set and also test error for the 10-fold cross-validation (all this code is copy pasted from above)

for (i in 1:5) {
  
pred_var <- testi_setti[[i]]

# find the model with glm() (target high_use)
m2 <- glm(high_use~., data = alc[pred_var], family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m2, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# call loss_func to compute the average number of wrong predictions in the (training) data
train_err <- round(loss_func(class = alc$high_use, prob = alc$probability), 4)

seed= 123

# cross-validation
cv <- cv.glm(data = alc[pred_var], cost = loss_func, glmfit = m2, K = 10)

# here we'll have results for cross-validation
test_err <- round(cv$delta[1], 4)

# saving the results to tulokset matrix
tulokset[1,i] <- train_err
tulokset[2,i] <- test_err

}

tulokset
##        [,1]   [,2]   [,3]   [,4]   [,5]
## [1,] 0.1757 0.1946 0.2000 0.2162 0.2108
## [2,] 0.2649 0.2757 0.2432 0.2243 0.2081

In tulokset matrix we have the result so that the first row is for training errors for the whole data and second row is test errors for cross-validation. Column 1 model has almost all predictor that one can pick and we decrease predictors (keeping the most relevant) till we get to fifth column which has only three.

It can bee noticed that training error tend to go up when there are fewer predictors in the model. The model learns better from the data and is more complex with more predictors, but on the other hand test error is very large. So the model with many predictors becomes dependent on the data that has been used, and it does not work that well when you fit it with different cross-validation sets. One would say that good model with significant predictor is far more robust, and simpler and passes the testing better.


Clustering and classification

Tuomas Keski-Kuha 28.11.2021


Data exploration and finding correlations


Exercise 2: Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Details about the Boston dataset can be seen for example here.


Let’s see the Boston data in the following:

# access a few libraries: 
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.2
## corrplot 0.92 loaded
library(GGally)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.1.2
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
## 
##     select
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")

# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Data set information

Data is Housing Values in Suburbs of Boston. It has 506 rows and 14 columns. It has some information on the suburbs of Boston (towns). Here is the link to the data page, where one can see the full details.


Exercise 3: Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.


Let’s visualize the data and take a look of the distributions of the variables and relations between variables.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# First lets make a long table from Boston data
melt.boston <- melt(Boston)
## No id variables; using all as measure variables
# Then plot every variable, so we'll see the distributions of variables
ggplot(data = melt.boston, aes(x = value)) + 
stat_density() + 
facet_wrap(~variable, scales = "free")

# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)

# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

There is significant variability in the data as some variables are really skewed towards the minimum or the maximum value (most of the values are near or exactly minimum or maximum). There is also few integer variables in the data, which don’t fit quite well to the density plot above. Few of the variables seems normally distributed.

It is clear that there is really strong correlations (positive and negative) between variables in the data, as correlation plot shows. In the correlation plot the colour and the size of the circles tells about the correlations between variables.

Quite interestingly the rad (radial highway) accessibility) and the tax (property tax rate) variables have the biggest positive correlations between the crime rate in towns. Some might think that variables like dis (distances to employment centers) and lstat (lower status of population(percent)) would have more correlation between crime rate. In the plot we can also see that rad and tax are heavily correlated so they are almost like one variable.

We can also see that there seems to be significant variables like:

  • rad: correlation between crime rate
  • tax: correlation between rad and crime rate
  • indus (proportion of non-retail business, traditional industry): has many significant correlations between other key variable
  • nox (nitrogen oxides concentration): has many significant correlations between key variables, seems like a variable which is hard to interpret in any reasonable way, at least there is positive correlation between indus.
  • age (proportion of owner-occupied units built prior 1940): old buildings is negatively correlated between distance to centers and positively between indus (more older buildings in more industrialized towns)
  • dis: many significant correlations between other variables, quite logically negatively correlated between indus so traditional industry is far from main centers.
  • lstat: many correlations to other variables, but perhaps not that significant

All in all, hard to gain significant information in this correlation analysis.


Data preparing and making training and test datasets


Exercise 4: Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.


Let’s scale the variables so that the mean = 0 and variance = 1 for all variables in the dataset:

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

We can see now that all the variables have negative and positive values around zero.

# perform principal component analysis (with the SVD method)
pca_boston <- prcomp(boston_scaled)

# define summaries so that we can compute real values of PC1 and PC2 (used for biplots) for both analysis made std and non std

s <- summary(pca_boston)
# rounded percentages of variance captured by each PC

pca_pr <- round(100*s$importance[2,], digits = 1)

# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw biplots of the principal component representation and the original variables both for non standardized data and standardized

biplot(pca_boston, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

In the following section we create a new categorical quantile variable from crim variable, so that the new variable has labels low, med_low, med_high, high depending the scaled crim variable, and all the new variables have 25 % of the data.

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

After that we make training and test sets from the data, so that we pick randomly 80 % of the rows to the training set and the rest to the test set.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

set.seed(123)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

Classification - Fitting the Linear Discriminant Analysis model


Exercise 5: Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.


Let’s fit the LDA model to training data and visualize results.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2500000 0.2500000 0.2450495 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       1.01866313 -0.9066422 -0.08120770 -0.8835724  0.38666334 -0.9213895
## med_low  -0.07141687 -0.3429155  0.03952046 -0.5768545 -0.09882533 -0.3254849
## med_high -0.40148591  0.2162741  0.19544522  0.3637157  0.12480815  0.4564260
## high     -0.48724019  1.0149946 -0.03371693  1.0481437 -0.47733231  0.8274496
##                 dis        rad        tax    ptratio      black        lstat
## low       0.9094324 -0.6819317 -0.7458486 -0.4234280  0.3729083 -0.766883775
## med_low   0.3694282 -0.5395408 -0.4205644 -0.1079710  0.3103406 -0.164921798
## med_high -0.3720478 -0.4349280 -0.3191090 -0.2716959  0.1049654  0.009720124
## high     -0.8601246  1.6596029  1.5294129  0.8057784 -0.6383964  0.900379309
##                 medv
## low       0.47009410
## med_low   0.01548761
## med_high  0.19440747
## high     -0.71294711
## 
## Coefficients of linear discriminants:
##                  LD1         LD2        LD3
## zn       0.148390645  0.74870532 -1.0874785
## indus    0.040971465 -0.38126652 -0.1619456
## chas     0.002460776  0.03963849  0.1699312
## nox      0.312458150 -0.67564471 -1.3104018
## rm       0.011086947 -0.16058718 -0.1572603
## age      0.283482486 -0.38817624 -0.1013491
## dis     -0.079848789 -0.38493775  0.2108038
## rad      3.718978412  0.93123532 -0.4706522
## tax     -0.015197127  0.04230505  1.2889075
## ptratio  0.180294774 -0.01592588 -0.3558490
## black   -0.136724112  0.02948075  0.1288959
## lstat    0.145739238 -0.37823065  0.3345688
## medv     0.061327205 -0.53906503 -0.1509890
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9523 0.0364 0.0113
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Discrimination analysis seem to fit the data quite well, as plot indicates. Rad variable (index of accessibility to radial highways) and also zn (proportion of residential land zoned for lots over 25,000 sq.ft.) seem to have biggest effect on the classification process. It looks like rad variable gives really good discrimination for higher crime towns. For the lower crim towns it’s not that clear at all. In the picture it seems that rad variable separates higher crime towns from the rest.

Why then the radiation (index of accessibility to radial highways) has so significant effect. It seems not to be obvious that good connections would effect to crime rate that much and others won’t (like lower status population, industrialization or distance from employment centers). Would it perhaps be that there is more density of population near highways or something like that, or more population near highways. Anyway rad seems to be good indicator for high crime rate.


Exercise 6: Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.


In the next phase we save crime gategories from the test set. Also we make predictions from test data, and see how accurate the model is:

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       10      10        4    0
##   med_low    2      17        6    0
##   med_high   1       9       13    2
##   high       0       0        1   27

Model’s predicting power seems to be good indeed. Just few bigger inaccuracies, where correct class is two classes away. Still all results end up quite near the diagonal. Especially the high class seems to be “easy” to predict.


Clustering - Fitting the K-means model


Exercise 7: Reload the Boston dataset and standardize the dataset (we did not do this in the Datacamp exercises, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.


# Load the data again and after that standardize all variables (mean = 0 and variance = 1)
data("Boston")

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

Then, we calculate the distances and apply K-means algorithm to make clusters:

# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# k-means clustering
km <-kmeans(boston_scaled, centers = 3)

Finding the optimal value for the number of clusters:

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

The optimal value is the value where the WCSS (within cluster sum of squares) drops radically (here the optimal number of clusters for K-means seems to be 2).

# k-means clustering when k = 2
km <-kmeans(boston_scaled, centers = 2)

# plot the Boston scaled dataset with clusters
pairs(boston_scaled, col = km$cluster)

We can see that perhaps crim, indus, rad, tax, black variables are best clustering variables (into two clusters), so let’s see figures with only those variables.

# lets make a vector containig significant separating variables
a_1 <- c(1, 3, 9, 10, 12)

# plot the Boston scaled dataset with clusters
pairs(boston_scaled[, a_1], col = km$cluster)

Presented variables seems to have largest effect on the clustering (into two clusers) as in the plot we can see good separation between the black and pink groups.

Next, we draw few plot more where one can see these significant variable values by clusters:

# let's draw few plots more where significant variable values are presented in a boxplot by clusters
par(mfrow=c(1,3))

boxplot(boston_scaled$rad ~ km$cluster, main = "Rad by clusters")

boxplot(boston_scaled$tax ~ km$cluster, main = "Tax by clusters")

boxplot(boston_scaled$black ~ km$cluster, main = "Black by clusters")

boxplot(boston_scaled$indus ~ km$cluster, main = "Indus by clusters")

boxplot(boston_scaled$crim ~ km$cluster, main = "Crim by clusters")

Perhaps one would not become wiser on looking drawn boxplots because many of the picked significant clustering variables has just few instances by cluster (e.g. crim, indus, tax), so one can critisize the separating power here by just one variable. But at least we had more some nice pictures here.


Classification of clusters (Bonus)


Bonus exercise: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters?


# Load the data again and after that standardize all variables (mean = 0 and variance = 1)
data("Boston")

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# k-means clustering
km <-kmeans(boston_scaled, centers = 5)

Let’s add clusters to the data and then apply linear discrinmination analysis to the clusters that we got from K-means clustering:

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, km$cluster)

set.seed(123)

# linear discriminant analysis
lda.fit_2 <- lda(km.cluster ~ ., data = boston_scaled)

# plot the lda results
plot(lda.fit_2, dimen = 2, col = boston_scaled$km.cluster, pch = boston_scaled$km.cluster)
lda.arrows(lda.fit_2, myscale = 1)

It seems clear that chas and rad variables are most influencial linear separators for the five clusters. Rad we handled before, it would seem that the highway accessibility makes some difference and separates towns. Chas is indicating that the riverside towns are significantly different than the non-riverside towns.


3D-plotting (Super-Bonus)


model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)


plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = ~train$crime)
# k-means clustering
km_2 <-kmeans(matrix_product, centers = 4)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = ~km_2$cluster)

Not really sure what was the intention with the latter 3D-plot. I applied K-means (k=4, same amount than in the first plot where predicted) to matrix product data (prediction for the LDA model) intending then to do clustering for predictions. This latter K-means clusterin makes data to look reaaly smoothly separated and there is no outliers or obvious outliers, or data points “inside” other group as in the first one. Now that I look at the second plot, it looks like just the same as you would have if you plot predictions od LDA model.


Dimensionality reduction techniques

Tuomas Keski-Kuha 2.12.2021


Data exploration and finding relations between variables


Exercise 1: Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.


The human data originates from the United Nations Development Programme. See their data page for more information. For a nice overview see also the calculating the human development indices pdf.

In the data we have countries and their ranking based on Human Development Index (HDI) and Gender Inequality Index (GII). Both of them are aggregated indexes constructed from other measures. In addition to these indexes there are several other variables. In analysis below we use mainly variables which are explained here. Few words about main indexes:

  • HDI: tries to capture population’s health, knowledge and standard of living (wealth). The HDI was created to emphasize that people and their capabilities should be the ultimate criteria for assessing the development of a country, not economic growth alone.
  • GII: tries to capture population’s health, empowerment, and labour market situation differences between sexes. The GII sheds new light on the position of women in 162 countries; it yields insights in gender gaps in major areas of human development.

In the following we explore the data with only limited number of variables (numerical variables only).

# access a few libraries just in case: 
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
library(GGally)
library(reshape2)
library(plotly)

# read the human data
#human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep  =",", header = T)

# set the working directory to iods project folder
setwd("c:/Tuomas/Opiskelu/Open Data Science/IODS-project")

# read human from data wrangling exercise output
human <- read.csv(file = "data/human.csv", 
                      stringsAsFactors = FALSE)

# add countries as rownames
rownames(human) <- human$X

# remove the first variable which is the country
human <- human[, -1]

str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
# look at the structure of human
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
# print out summaries of the variables
summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
# let's draw ggpairs -plot to see an overview of the data

ggpairs(human, lower = list(combo = wrap("facethist", bins = 20)))

# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot

We see from plots that most of the variables are quite nicely distributed (normally), but there are exeptions like GNI (gross national income per capita) and Mat.Mor (maternal mortality ratio). It indicates that wealth and health differences are very significant between countries.

There are very stong negative and positive correlations between variables. Couple of correlations seems to have straightforward explanations like (Life expectancy between years in schooling and LIfe expectancy between maternal mortality). Let’s note the following:

  • GNI has significant correlations between almost all of the variables except the variables which are GII variables (except educational differences). Seems reasonable that GNI has positive correlations between Edu.Exp (expected years of schooling) and Life.Exp (Life expectancy at birth). It has also quite logically negative correlations with Mat.Mor and Ado.Birth (adolescence birth rate, perhaps measures knowledge in general and also inequality between sexes)
  • No correlation between adolescent birth rate and women in power, which seems to be a surprise. Actually women in power (Parli.F) has surprisingly weak correlations between other variables, even educational differences between sexes. Also labour differences between sexes are not correlated heavily between other variables. Almost all other variables have quite significant correlations between other variables actually.

Principal Component Analysis (PCA)


Exercise 2: Perform principal component analysis (PCA) on the not standardized human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables.


Principal Component Analysis (PCA) can be performed by two sightly different matrix decomposition methods from linear algebra: the Eigenvalue Decomposition and the Singular Value Decomposition (SVD).

Both methods quite literally decompose a data matrix into a product of smaller matrices, which let’s us extract the underlying principal components. This makes it possible to approximate a lower dimensional representation of the data by choosing only a few principal components.

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Here the results are dominated by GNI which has very strong absolute variability, because data was not standardized and GNI values are far different than other variables (absolute differences are so large). Standardization seems like a good plan.


Exercise 3: Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomenons they relate to.


# standardize the variables
human_std <- scale(human)

# perform principal component analysis to the std. data (with the SVD method)
pca_human_std <- prcomp(human_std)

# define summaries so that we can compute real values of PC1 and PC2 (used for biplots) for both analysis made std and non std

s <- summary(pca_human)

s_std <- summary(pca_human_std)

# rounded percentages of variance captured by each PC

pca_pr <- round(100*s$importance[2,], digits = 1)

pca_pr_std <- round(100*s_std$importance[2,], digits = 1) 

# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

pc_lab_std <- paste0(names(pca_pr_std), " (", pca_pr_std, "%)")

# draw biplots of the principal component representation and the original variables both for non standardized data and standardized

biplot(pca_human, main = "PDA on non standardized data", cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

biplot(pca_human_std, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab_std[1], ylab = pc_lab_std[2], main = "PDA on standardized data")

Standardazing seems to make PDA model more effective. There are clear direction to the arrows for the variables. The first PDA model on the non standardized had only one explanatory variable on the PD1 which was GII and all others explained the rest of the data variability (PD1 explained 100 % of the variability in the country data). So the latter plot is completely different than the previous as in the arrows and the values of PC1 and PC2 (PD1 explains 53,6 % and PD2 16,2 % of the variability in the data). Also the counties are far more evenly spread on the plot, as in the non std. analysis there the most of the countries were on the top right and what the model did was separate rich countries from the rest, and picked some very poor countries to the low right.

On the lengths of the arrows (standard deviation of variable) the standardization can be seen really nicely as in the latter the standard deviation is one for all of the variables, so all arrows have almost same length.

It’s hard to say which is not different with these two plots. GNI arrow can be seen in both and some same very rich countries are pointing out in the both plots (like traditional oil countries Qatar, Kuwait, United Arabic Emirates).


Exercise 4: Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data.


It looks like that health, wealth and knowledge differences between countries are explained by PC1 (variables like GII, Life.Exp and Mat.Mor). The rest of the variability in the data is explained by gender based factors like labour force differences (Labo.FM) and women in power (Parli.F). The independency between these gender based variables and others can be easily seen in the plot above as arrows are approximately non-parallel (somewhat 90 agrees between them).

Also the correlations between “PC1” variables (positive and negative) are really well presented here (as arrows are pointing same direction or straight to the opposite), and these variables have strong correlation between each other, and behave as a group regarding the variability between countries.

In the plot rich countries are on the left and poor on the right. In the top are more equal between sexes countries (Rwanda is interesting outlier, don’t know if the women are more involved in labour force as it’s quite underdeveloped country due to its dark history. Perhaps there is more agriculture where also women are more involved) and counties in the bottom are more inequal.


Multiple Correspondence Analysis (MCA)


Exercise 5: Load the tea dataset from the package Factominer. Explore the data briefly: look at the structure and the dimensions of the data and visualize it. Then do Multiple Correspondence Analysis on the tea data (or to a certain columns of the data, it’s up to you). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots.

Multiple Correspondence Analysis (MCA) is a method to analyze qualitative data and it is an extension of Correspondence analysis (CA). MCA can be used to detect patterns or structure in the data as well as in dimension reduction. In other words it tries to pack information from the data to some categorical variables (reduce variables to dimension variables), so that we can in a way group individual observations.

Tea-data

The data used here concern a questionnaire on tea. We asked to 300 individuals how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).

# access a FactoMineR package and also factoextra: 
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.1.2
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.1.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
##read tea-data
data("tea")

# look at the structure of tea
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# visualize the tea data in three plots
gather(tea[, 1:12]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 6))
## Warning: attributes are not identical across measure variables;
## they will be dropped

gather(tea[, 13:24]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 6))
## Warning: attributes are not identical across measure variables;
## they will be dropped

gather(tea[, 25:36]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 6))
## Warning: attributes are not identical across measure variables;
## they will be dropped

We can see that all but one (age) are categorical variables, and for them is suitable to just count cases in the plots.

In the following sections we do two MCAs (this is not exactly what was asked):

  • With all variables in the data
  • With selected variable in the data

There was quite a lot ready-made and handy R-code for plots in the web and I used them without any hesitation in analysis for MCA.


MCA on the tea data with all variables


In MCA analysis soon to follow, the first 18 questions in the Tea-data are active ones, the 19th is a supplementary quantitative variable (the age) and the last variables are supplementary categorical variables. Supplementary variables are not actually used in following MCA, so if we apply all variables to MCA it means just the active variables (the first 18).

# let's run MCA first on all of the categorical variables and then decide which variables we uses in further analysis, most effective variables seen in variables representation for two first dimensions of MCA. Don't use here the graphs = FALSE, so we'll get graphs from the analysis
res.mca=MCA(tea,quanti.sup=19,quali.sup=20:36)

## Warning: ggrepel: 22 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Active variables are red and supplementary variables green ones in the first plot above. In the second MCA factor map plot are plotted individuals (all the data) into MCA model dimensions 1 and 2. In factor maps the distance between any row points (individuals) or column points (variable vales) gives a measure of their similarity (or dissimilarity). Row points with similar profile are closed on the factor map. The same holds true for column points.

Comments on factor maps of two first dimensions:

  • Few variable values form “group” for more similar individuals (tea shop, unpackaged, p_upscale). It can be seen in the factor maps that those far up categorical values “strech” individuals towards them.
  • There are also variable values which gather in the right in the picture (tearoom, other etc.) which seem to pull observations from the origo towards them to make a loose group of more similar individuals
  • There is lots of individuals near the origo, and form a third group.

Variables representation gives correlations of variables to Dim 1 and Dim 2) where, how, price are the most relevant for the first two dimensions.

There could be more variable values which separate individuals in higher dimensions also, but these first dimension explain more of the variation in the data.We can pick also the variation that each dimension retains and see that in the plot:

# To visualize the percentages of inertia (variation) explained by each MCA dimensions, use the function fviz_eig() or fviz_screeplot() [factoextra package]:
fviz_screeplot(res.mca, addlabels = TRUE, ylim = c(0, 11))

In this plot we can see that the first two are more relevant than others, but explain only ~18 % of the variation.

We can also calculate the contributions from categories (variable values) for the first three dimensions and plot the whole variable set to the heatmap-kind-of-plot where contributions can be easily seen.

# Let's calculate also contributions to 
# Contributions of variables to DIM 1
fviz_contrib(res.mca, choice = "var", axes = 1, top = 10)

# Contributions of variables to DIM 2
fviz_contrib(res.mca, choice = "var", axes = 2, top = 10)

# Contributions of variables to DIM 2
fviz_contrib(res.mca, choice = "var", axes = 3, top = 10)

# Heat-map for contribtions of the first two dimensions
fviz_mca_var(res.mca, col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE, # avoid text overlapping (slow)
             ggtheme = theme_minimal()
             )
## Warning: ggrepel: 26 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Let’s pick the most significant variables from Variable categories plot for dimensions 1 and 2 for next MCA analysis. Those which are near the origo are not significant to first two dimensions. Let’s also pick some variables which are significant for dimension 3.


MCA on the tea data with selected variables


Perhaps the whole analysis (if want to focus on the two first dimensions) would be just as good with just three variables: price, where, how, but let’s keep some more tearoom, friends, resto, Tea, How in the second analysis below.

# We pick the most relevant from the analysis for the whole data and do the MCA again with those variab

keep_columns <- c("where", "tearoom", "friends", "how", "price", "Tea", "resto", "How")

# select the 'keep_columns' to create a new dataset
tea_time <- tea[, keep_columns]

# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

# multiple correspondence analysis on tea_time dataset, so it's not a whole data
mca <- MCA(tea_time, graph = TRUE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = TRUE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.289   0.255   0.168   0.150   0.134   0.132   0.128
## % of var.             13.598  12.005   7.893   7.039   6.327   6.227   6.005
## Cumulative % of var.  13.598  25.603  33.497  40.536  46.863  53.091  59.096
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.124   0.111   0.105   0.100   0.095   0.085   0.078
## % of var.              5.840   5.204   4.931   4.723   4.488   4.013   3.671
## Cumulative % of var.  64.935  70.140  75.070  79.793  84.282  88.295  91.966
##                       Dim.15  Dim.16  Dim.17
## Variance               0.066   0.063   0.041
## % of var.              3.122   2.970   1.942
## Cumulative % of var.  95.088  98.058 100.000
## 
## Individuals (the 10 first)
##                         Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                    | -0.629  0.456  0.101 |  0.079  0.008  0.002 |  0.910
## 2                    | -0.387  0.173  0.098 |  0.001  0.000  0.000 |  0.593
## 3                    | -0.109  0.014  0.012 | -0.457  0.273  0.218 | -0.137
## 4                    | -0.459  0.243  0.256 | -0.036  0.002  0.002 | -0.186
## 5                    | -0.459  0.243  0.256 | -0.036  0.002  0.002 | -0.186
## 6                    | -0.731  0.616  0.235 |  0.089  0.010  0.003 |  0.082
## 7                    | -0.277  0.088  0.117 | -0.205  0.055  0.064 | -0.369
## 8                    | -0.387  0.173  0.098 |  0.001  0.000  0.000 |  0.593
## 9                    |  0.376  0.163  0.069 |  0.162  0.034  0.013 |  0.155
## 10                   |  0.807  0.751  0.264 |  0.291  0.111  0.034 |  0.330
##                         ctr   cos2  
## 1                     1.645  0.211 |
## 2                     0.698  0.228 |
## 3                     0.038  0.020 |
## 4                     0.069  0.042 |
## 5                     0.069  0.042 |
## 6                     0.013  0.003 |
## 7                     0.270  0.208 |
## 8                     0.698  0.228 |
## 9                     0.048  0.012 |
## 10                    0.217  0.044 |
## 
## Categories (the 10 first)
##                          Dim.1     ctr    cos2  v.test     Dim.2     ctr
## chain store          |  -0.590   9.653   0.620 -13.614 |  -0.139   0.610
## chain store+tea shop |   1.114  13.954   0.436  11.417 |  -0.519   3.434
## tea shop             |   0.883   3.373   0.087   5.090 |   2.243  24.645
## Not.tearoom          |  -0.298   3.094   0.370 -10.517 |   0.073   0.210
## tearoom              |   1.242  12.910   0.370  10.517 |  -0.304   0.876
## friends              |   0.272   2.085   0.139   6.448 |  -0.236   1.788
## Not.friends          |  -0.512   3.930   0.139  -6.448 |   0.445   3.371
## tea bag              |  -0.608   9.073   0.484 -12.030 |  -0.156   0.678
## tea bag+unpackaged   |   0.761   7.854   0.264   8.892 |  -0.410   2.586
## unpackaged           |   0.885   4.069   0.107   5.653 |   1.810  19.257
##                         cos2  v.test     Dim.3     ctr    cos2  v.test  
## chain store            0.035  -3.216 |   0.076   0.277   0.010   1.758 |
## chain store+tea shop   0.095  -5.322 |  -0.146   0.411   0.007  -1.493 |
## tea shop               0.559  12.927 |  -0.109   0.089   0.001  -0.630 |
## Not.tearoom            0.022   2.574 |  -0.052   0.165   0.011  -1.850 |
## tearoom                0.022  -2.574 |   0.219   0.688   0.011   1.850 |
## friends                0.105  -5.611 |  -0.207   2.088   0.081  -4.915 |
## Not.friends            0.105   5.611 |   0.390   3.934   0.081   4.915 |
## tea bag                0.032  -3.091 |   0.160   1.083   0.034   3.167 |
## tea bag+unpackaged     0.077  -4.794 |  -0.375   3.282   0.064  -4.379 |
## unpackaged             0.447  11.556 |   0.223   0.444   0.007   1.422 |
## 
## Categorical variables (eta2)
##                        Dim.1 Dim.2 Dim.3  
## where                | 0.624 0.586 0.010 |
## tearoom              | 0.370 0.022 0.011 |
## friends              | 0.139 0.105 0.081 |
## how                  | 0.485 0.460 0.065 |
## price                | 0.421 0.414 0.254 |
## Tea                  | 0.035 0.174 0.428 |
## resto                | 0.101 0.202 0.111 |
## How                  | 0.136 0.078 0.382 |

From the summary Categorical variables, we can see the squared correlations between each variable and the dimensions. If the value is close to one it indicates a strong link with the variable and dimension. Here we can see that most important for the dimensions:

  • Dim 1 are where, how, price, tearoom variables
  • Dim 2 are where, how, price variables
  • Dim 3 are Tea, How variables

These were the quite same variables which could be seen in the previous MCA model, where all variables were used.

In the next plot we see percentages of the variance in the data that dimension variables retain in the analysis.

# To visualize the percentages of inertia (variation) explained by each MCA dimensions, use the function fviz_eig() or fviz_screeplot() [factoextra package]:
fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 15))

In this plot we can see that the first two are more relevant than others. Still they explain only ~25 % of the variation.

Let’s plot the results to the factor map where we can see variables by different colous. In the factor map we can see how similar (small distance) or dissimilar (big distance) these categories are. Also in the second factor map we can see the contributions to first two dimensions:

# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

# heat map for contributions:
fviz_mca_var(mca, col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE, # avoid text overlapping (slow)
             ggtheme = theme_minimal()
             )

Comments:

  • As in the previous analysis where, how, price and only those three values (tea shop, unpackaged, p_upscale) form a more similar group.
  • tearoom, chain store+tea shop, other seem to also draw individuals to the right and form a loose group.
  • It is interesting that How = other does not contribute to the first two dimensions, but perhaps it’s due to the fact that it has just 9 individual observations, so it won’t effect to large amount of individuals in the analysis. Another explanation could be that it has greater significance in higher dimensions and this is true (summaries Categorical variables table) so it seems that it’s not correlated so heavily with other variables which are significant in the first two dimensions.
#let's look at How-variable values: 
summary(tea_time$How)
## alone lemon  milk other 
##   195    33    63     9

All in all, I think that in the end for two first dimensions there can be seen three groups (with straight forward discrimination and some might challenge these conclusions):

  • there are “high-end” tea drinkers, who all more or less drinks expensive tea bought from tea shop (price = upscale, how = unpackaged, where = tea shop), but they perhaps drink tea in tearooms and also somewhere else.
  • there are “tea room” tea drinkers, who buys tea from tea shops and also from chain stores, but they especially like to drink tea in the tearoom.
  • “regular” tea drinkers who use not expensive tea bag and buy it from chain store and are not that interested how it’s packaged. They do not drink tea that much in tearooms.

We can also plot an extra plot where we can see the factor maps for some key variables. Plot draws an confidence ellipses for variable values (also individuals can be seen here and their answers also to there questions) which indicate the statistical significance of variable value difference. If ellipses cross, it indicates that there is no statistical difference between variable values. On the other hand if ellipses are far from each other, then variable values statistically really differ on each other.

fviz_ellipses(mca, c("where", "how", "price", "How"),
              geom = "point")

Few comments:

  • how and where variables form quite distinct groups for all variable values and they actually look really similar to other so these values seem strongly correlated
  • price variable seems also correlated to some extent to how and where variables, but for price only p_upscale variable seem to be different than the rest of the values
  • How variable is more uncertain. There seem to be some differences, but differences in the picture are quite small after all. Value other is distinct but really uncertain perhaps due to small number of observation.

Comments on the exercises and learning


This week was really fun and I took quite a bit of liberties doing the exercises (especially with MCA part). Of course understanding is limited for methods, but there was helpful material in the web.


Analysis of longitudinal data

Tuomas Keski-Kuha 12.12.2021

# access libraries: 
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
library(GGally)
library(reshape2)
library(plotly)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.2
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.4     v stringr 1.4.0
## v readr   2.1.0     v forcats 0.5.1
## v purrr   0.3.4
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'stringr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x plotly::filter() masks dplyr::filter(), stats::filter()
## x dplyr::lag()     masks stats::lag()
## x MASS::select()   masks plotly::select(), dplyr::select()
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.1.2
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.1.2
library(lme4)
## Warning: package 'lme4' was built under R version 4.1.2
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

Graphical Displays and Summary Measure Approach for RATS data


Exercise 1: Implement the analyses of Chapter 8 of MABS using the RATS data. (0-7 points: 0-4 points for graphs or analysis results + 0-3 points for their interpretations)


Rats, it’s RATS - Graphical Displays of RATS


Graphical displays of data are almost always useful for exposing patterns in the data, particularly when these are unexpected; this might be of great help in suggesting which class of models might be most sensibly applied in the later more formal analysis.

In the following we have the RATS data which contains rat weight developments in different diets. In the data we have three groups of individual rats that were put on a different diets, and each rat’s (16 rats in total) weight in grams was measured repeatedly over a 9-week period.

Goal of the study was to determine does these diets have different responses on the rat growth. Let’s first load the data (long form data) and let’s draw a plot where all the information can be seen.

# load the data

# set the working directory to iods project folder
setwd("c:/Tuomas/Opiskelu/Open Data Science/IODS-project")
RATSL <- read.csv(file = "data/RATSL.csv")

RATSL <- within(RATSL, {
  ID <- factor(ID)
  Group <- factor(Group)
})

#  read original RATS data
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", sep  ="\t", header = T)

# changing categorical variables to factors
RATS <- within(RATS, {
  ID <- factor(ID)
  Group <- factor(Group)
})

# draw a plot of the rat data
ggplot(RATSL, aes(x = Time, y = Weight, group = ID)) +
  geom_line(aes(linetype = Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "Weight (grams)") +
  theme(legend.position = "top")

Comments:

  • First of all starting weights (baseline) are quite different on different rat groups. Group 2 and 3 rats are much heavier than group 1 rats at the start and on the course of the study.
  • In groups 2 and 3 there are fewer rats (4 rats each) than in the first group (8 rats). Also the variety of the baselines for 2 and 3 group rats is larger than in rats in the group 1.
  • It looks like weights of almost all the rats are growing during the study period.
  • At a first glance group 1 rat development seems more stable than in the other groups, where weights seem to grow more rapidly and looks.

Let’s standardize all the rat weights in the next part and let’s see if there could be found more variance between the groups.

# Standardise the scores for RATS
RATSL_std <- RATSL %>%
  group_by(Time) %>%
  mutate( stdrats = (Weight - mean(Weight))/sd(Weight) ) %>%   ungroup()

glimpse(RATSL_std)
## Rows: 176
## Columns: 6
## $ ID      <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3~
## $ Group   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1,~
## $ WD      <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1",~
## $ Weight  <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 47~
## $ Time    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8,~
## $ stdrats <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, -0~
p1 <- ggplot(RATSL_std, aes(x = Time, y = stdrats, linetype = ID))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + facet_grid(. ~ Group, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(name = "standardized rats")
p6

Remarks:

  • All the group 1 rats are packed so tightly so it’s not easy to see the differences. Their growth patterns seems to be more similar than in the other groups.
  • In group 2 there is one rat which weight much more than others in that group. Also in group 2 there is one rat which development is quite different than the others (it does not grow that fast than the rest). Actully that rat is really the different one in that group for its growth pattern.
  • In group 3 there’s one rat which is growing actually faster than the others.

In the next chapter we calculate summary measures for each group and compare differences between groups.


Summary Measure Approach of RATS


In the first plot we draw the mean weights and standard deviations around the mean (mean +/- standard deviation).

# Number of weeks, baseline (week 0) included
n <- RATSL$Time %>% unique() %>% length()


# Make a summary data:
RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise( mean=mean(Weight), se=sd(Weight)/sqrt(n)) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
p1 <- ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group))
p2 <- p1 + geom_line() + scale_linetype_manual(values = c(1,2,3))
p3 <- p2 + geom_point(size=3) + scale_shape_manual(values = c(1,2,3))
p4 <- p3 + geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=1.5)
p5 <- p4 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6 <- p5 + theme(legend.position = c(0.9,0.45))
p7 <- p6 + scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
p7

In the plot above:

  • Means of the groups seem to develop differently. Groups 2 and 3 seem to grow more rapidly than the group 1 if we look a the mean of the rat weights.
  • Standard deviation around the mean is just telling that there are different weight rats more in groups 2 and 3. It’s not that interesting and don’t actually tell anything of the variance in growing pattern, if there’s any in the groups.

Next we’ll focus on the growth patterns of the rats. Let’s calculate the growth percents of all rats at every measuring point and check if there’s anything interesting there. After the growth percent calculation, we just study the means of growth percents (from all the measuring points) the groups and plot them into a boxplot.

# Let's use the wide data in which it's perhaps easier to calculate the growth percents

RAT_g <- RATS

# we need a loop (for columns, time points) for calculate the growth percents for all the rats and all
for (i in 4:13) {
  RAT_g[, i] = (RATS[, i]-RATS[, i-1])/RATS[, i-1]*100
  
}

# let's drop the baseline (first weight)
RAT_g <- RAT_g[, -3]

# modify RAT_g to long data
RAT_gL <- gather(RAT_g, key = WD, value = grow, -ID, -Group) %>%
  mutate(Time = as.integer(substr(WD,3,4)))

# Number of weeks, baseline (week 0) included
n <- RAT_gL$Time %>% unique() %>% length()

# Make a summary data:
#RAT_gS <- RAT_gL %>%
#  group_by(Group, Time) %>%
#  summarise( mean=mean(grow), se=sd(grow)/sqrt(n)) %>%
#  ungroup()

# let's calculate the means of the growth percents for all individual rats
RAT_gS1 <- RAT_gL %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(grow) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# plot the mean growth percents
ggplot(RAT_gS1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "blue") +
  scale_y_continuous(name = "mean(growth percent)")

This plot shows perhaps something interesting in the data, but of course the variability in the group 1 is smaller initially because there’s more rats in the group. But here we perhaps see that the mean of the group 2 is different than the other groups if we only focus on the growth pattern. But it’s hard to make strong statements based only on these. Perhaps one could also compare growth percents for all time points separately.

Let’s also look at some statistic calculations of the summary measured data. We take the individual rat baseline into linear model analysis to check how relevant factor it is. We’ll also fi linear model without baseline as a predictor to see the difference.

# Create a summary data by group and id with mean as the summary variable (ignoring baseline week 0).
RATSL8S <- RATSL %>%
  filter(Time > 0) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# Add the baseline from the original data as a new variable to the summary data
RATSL8S2 <- RATSL8S %>%
  mutate(baseline = RATS$WD1)

# Fit the linear model with the mean as the response 
fit_1 <- lm(mean ~ Group, data = RATSL8S2)

summary(fit_1)
## 
## Call:
## lm(formula = mean ~ Group, data = RATSL8S2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.886 -27.031   2.284  10.588 105.750 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   263.72      12.96  20.346 3.06e-11 ***
## Group2        220.99      22.45   9.844 2.16e-07 ***
## Group3        262.08      22.45  11.674 2.91e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.66 on 13 degrees of freedom
## Multiple R-squared:  0.9313, Adjusted R-squared:  0.9207 
## F-statistic: 88.07 on 2 and 13 DF,  p-value: 2.763e-08
# Fit the linear model with the mean as the response with baseline as predictor 
fit_2 <- lm(mean ~ baseline + Group, data = RATSL8S2)

summary(fit_2)
## 
## Call:
## lm(formula = mean ~ baseline + Group, data = RATSL8S2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.732  -3.812   1.991   6.889  13.455 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.14886   19.88779   1.516   0.1554    
## baseline     0.93194    0.07793  11.959 5.02e-08 ***
## Group2      31.68866   17.11189   1.852   0.0888 .  
## Group3      21.52296   21.13931   1.018   0.3287    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.62 on 12 degrees of freedom
## Multiple R-squared:  0.9947, Adjusted R-squared:  0.9933 
## F-statistic: 747.8 on 3 and 12 DF,  p-value: 6.636e-14
# Compute the analysis of variance for the fitted models with anova()
anova(fit_2, fit_1)
## Analysis of Variance Table
## 
## Model 1: mean ~ baseline + Group
## Model 2: mean ~ Group
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     12  1352.4                                  
## 2     13 17471.4 -1    -16119 143.02 5.023e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From these analysis we can see that:

  • Without the baseline the group variable seem to be okay predictor for the means, but we can see from the residual standard errors that baseline is much better whem predicting mean. Standard error for baseline model is 10.62 and without it it’s 36.66. But still the R-squared measure 0.93 for the model without baseline is not bad at all, so it catches also most of the variability in the data.
  • Predicting mean value of individual rat means is highly correlated between baseline, which is reasonable from the data plots that we saw above (individual lines do not grow so fast or do not variate that much)
  • First model (without baseline predictor) also somewhat contains information of the base levels of rats as it’s obvious that group levels are different from each other.
  • Analysis of the variation (variation is the variation of the residuals) shows also that baseline predictor model is significantly better than the other in its prediction power. RSS is the residual sum of squares.

Linear Mixed Effects Models for Normal Response Variables for BPRS data


Exercise 2: Implement the analyses of Chapter 9 of MABS using the BPRS data. (0-8 points: 0-4 points for graphs or analysis results + 0-4 points for their interpretations)


In this chapter we’ll do several analyses which all are linear regression models as title suggests (Linear Mixed Effect Models), but with some added parameters and features. Aim is to have a better model for repetitive data than just basic linear regression, which assumes that all the observations are independent. Here are the four models that we’ll apply (in the brackets we shorten the model name):

  • (Basic) Linear regression model (LRM)
  • Random random intercept model (RIM)
  • Random random intercept and random slope model (SLO)
  • Random intercepts and random slope model with interaction (INT)

The data that we use is the BPRS data 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.

Let’s load the BPRS data, take a glimpse, and draw a plot of it.

# load the data

# set the working directory to iods project folder
setwd("c:/Tuomas/Opiskelu/Open Data Science/IODS-project")
BPRSL <- read.csv(file = "data/BPRSL.csv")

BPRSL <- within(BPRSL, {
  treatment <- factor(treatment)
  subject <- factor(subject)
})

# Glimpse the data

glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0~
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, ~
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
# Draw the plot
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))

Comments of the data:

  • Both of the treatment groups have 20 patients each.
  • Variability and the overall pattern seems quite similar for each group, peraps for treatment 2 there seems to be a bit more variability and one really high value patient the start. If we don’t analyse the just the pattern, but the actual values, it would perhaps be good to drop that observation from the analysis.
  • Also the starting level seems to be similar.

Linear regression model (LRM)


First, we try just a simple linear regression model (LRM) on the BPRS-data with bprs as response and week and treatment as explanatory variables. Here we ignore the repeated-measures structure of the data, and assume that observations are all independent knowing that there is same subjects measured multiple times and these observations tend to correlate with each other as we measure same subject.

The form of (LRM) with two explanatory variables (week (time) \(T\) and treatment \(X\)) is \[ Y \sim \beta_0 + \beta_1 T + \beta_2 X + \epsilon \] Here the residual term \(Y(t, x_i)-y_i\) (the difference between an observed and predicted value - also model error) is normally distributed random variable \(\epsilon\) with zero mean and variance \(\sigma^2\).

# create a regression model BPRS_reg
BPRS_LRM <- lm(bprs ~ week + treatment, data = BPRSL)

# print out a summary of the model
summary(BPRS_LRM)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16
plot(BPRS_LRM, 1)

Commentary:

  • Results of the LRM fit don’s seem to be that good, especially the R-squared measure seems pretty low here, so model don’t capture the variation in the data that well (if you compare with the similar RATS analysis below for example).
  • Still time variable is significant feature here, and has a significant effect (bprs is heavily going down during the study for the mean at least). And also the intercept value which can be seen as the baseline (constant in the linear regression model).
  • Treatment variable on the otherhand don’t seem to have significance for the LRM.
  • From the plot we can se that residuals seem to grow with the actual values (indicating non-linearity). That indicates that this model is perhaps not suitable.

Random intercept model (RIM)


The previous model assumes independence of the repeated measures of bprs, and this assumption is highly unlikely. So, no we try something more suitable for repetitive type of study.

To begin the more formal analysis of the BPRS data, we will first fit the random intercept model (RIM) for the same two explanatory variables: week and treatment Fitting a random intercept model allows the linear regression fit for each subject to differ in intercept from other subjects. So in this model it’s possible to have different profiles with symptom development.

The form of (RIM) for the subject \(i\) and at the time \(t_j\) where \(j\) represent the week \[ y_{ij} = (\beta_0 + u_i) + \beta_1 t_j + \beta_2 x_i + \epsilon_i \] Here the variance of each repeated measurement is \({Var}(y_{ij}) = {Var}(\epsilon_i + u_i) = \sigma^2+u_i^2\) (independent of time), where random effects

  • \(\epsilon_i \sim N(0,\sigma^2)\) “normal” random effects.
  • \(u_i \sim N(0,\sigma_u^2)\) is the random effect specific to the subject \(i\).

RIM kind of split the random effect in (LRM) to individual random effect which depends of the subject and the basic random effect.

So let’s fit this model next:

# Create a random intercept model
BPRS_RIM <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)

# Print the summary of the model
summary(BPRS_RIM)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2582.9   2602.3  -1286.5   2572.9      355 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.27506 -0.59909 -0.06104  0.44226  3.15835 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept) 97.39    9.869   
##  Residual             54.23    7.364   
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.3521  19.750
## week         -2.2704     0.1503 -15.104
## treatment2    0.5722     3.2159   0.178
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.256       
## treatment2 -0.684  0.000

Notes:

  • Random effect standard deviance for RIM is not that small comparing with LRM, here variance in total is about 150 as it was 153 for LRM (LRM std. deviance 12.37 so the accuracy is not that amazing)
  • RIM model did not critically change the significances of the variables. So the treatment does not not matter in this analysis either.

Random intercept and random slope model (SLO)


Now we can move on to fit the random intercept and random slope model (SLO) to the data. Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept but also in slope. This way it is possible to account for the individual differences in the bprs development profiles, but also the effect of time.

The form of (SLO) for the subject \(i\) and at the time \(t_j\) where \(j\) represent the week \[ y_{ij} = (\beta_0 + u_i) + (\beta_1+v_i) t_j + \beta_2 x_i + \epsilon_{ij} \] Here the variance of each repeated measurement is \[ {Var}(y_{ij}) = {Var}(u_i + v_it_j +\epsilon_i) = \sigma_u^2+2\sigma_{uv}t_j+\sigma_v^2t_j^2+\sigma^2, \] where random effects

  • \(\epsilon_i \sim N(0,\sigma^2)\) “normal” random effects.
  • \(u_i \sim N(0,\sigma_u^2)\) is the random effect specific to the subject \(i\).
  • \(v_i \sim N(0,\sigma_v^2)\) allows the linear regression fits for each individual to differ in slope. These are allowed to be correlated with the \(u_i\) random intercept effects.

Let’s note here that this model allows correlation between residuals (at different time points) of the same individual.

# create a random intercept and random slope model
BPRS_SLO <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)

# print a summary of the model
summary(BPRS_SLO)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2523.2   2550.4  -1254.6   2509.2      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4655 -0.5150 -0.0920  0.4347  3.7353 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 167.827  12.955        
##           week          2.331   1.527   -0.67
##  Residual              36.747   6.062        
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  45.9830     2.6470  17.372
## week         -2.2704     0.2713  -8.370
## treatment2    1.5139     3.1392   0.482
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.545       
## treatment2 -0.593  0.000

Commenting:

  • Log-likelihood is bit better for SLO than RIM,
  • Variances of random effects seems to be larger for SLO than for RIM. New slope error term don’t capture variance from the model, variance is small for it.
  • Overall change from the RIM model is quite minimal

Random intercept and random slope model with interaction (INT)


Finally, we can fit a random intercept and slope model that allows for a treatment × time interaction.

The form of INT for the subject \(i\) and at the time \(t_j\) where \(j\) represent the week \[ y_{ij} = (\beta_0 + u_i) + (\beta_1+v_i) t_j + \beta_2 x_i + \beta_3 x_it_j + \epsilon_{ij}. \]

# create a random intercept and random slope model with week x treatment interaction
BPRS_INT <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)

# print a summary of the model
summary(BPRS_INT)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2523.5   2554.5  -1253.7   2507.5      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4747 -0.5256 -0.0866  0.4435  3.7884 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 164.204  12.814        
##           week          2.203   1.484   -0.66
##  Residual              36.748   6.062        
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.9840  16.047
## week             -2.6283     0.3752  -7.006
## treatment2       -2.2911     4.2200  -0.543
## week:treatment2   0.7158     0.5306   1.349
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.668              
## treatment2  -0.707  0.473       
## wek:trtmnt2  0.473 -0.707 -0.668

Remarks:

  • Log-likelihood is ever so slightly better for INT than SLO, but it is very close call.
  • Variances for the random effects are just a bit lower for INT than SLO.
  • For the fixed effects the new variable week x treatment2 has a bit of a effect for the model. Perhaps for the week x treatment2 term we have a really weak statistical significance, so it indicates that for these two treatments there could be a bit different slopes in general.

Next let’s draw plots for different model fits to compare.

# Create a vector of the fitted values
fitted_RIM <- fitted(BPRS_RIM)

# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>%
  mutate(fitted_RIM)



ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
  geom_line(aes(linetype = treatment)) + ggtitle("Original observations") +
  theme(legend.position = "top")

# Draw the plot

ggplot(BPRSL, aes(x = week, y = fitted_RIM, group = subject)) +
  geom_line(aes(linetype = treatment)) + ggtitle("Random intercept fit") +
  theme(legend.position = "top")

# Create a vector of the fitted values
fitted_SLO <- fitted(BPRS_SLO)

# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>%
  mutate(fitted_SLO)

# draw the plot

ggplot(BPRSL, aes(x = week, y = fitted_SLO, group = subject)) +
  geom_line(aes(linetype = treatment)) + ggtitle("Random intercept + slope fit") +
  theme(legend.position = "top")

# Create a vector of the fitted values
fitted_INT <- fitted(BPRS_INT)

# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>%
  mutate(fitted_INT)

# draw the plot

ggplot(BPRSL, aes(x = week, y = fitted_INT, group = subject)) +
  geom_line(aes(linetype = treatment)) + ggtitle("Random intercept + slope + interaction fit") +
  theme(legend.position = "top")

Random intercept with random slope with and without interaction is notably different than the random intercept. And perhaps the random intercept with random slope is better for capturing those subject whose symptoms are not decreasing.

It’s quite hard to compare with the actual data because it’s so variable and it looks like in the original data it is hard to fit linear model for it. Individual slopes seem to capture the effect that perhaps those who have stronger symptoms will have more “space” to recover and getting better than with those subjects who are better at the start.


Comments on the exercises and learning


This week was really tough to manage and have enough time for exercises. But I managed to squeeze these in time. I think I learned quite a bit this week for these model because it was a struggle. And with better R skills it would have been a lot easier. Intepretation were really challenging this week I think and linear mixed effect models are not that easy to understand as they first seem to be.

It get me a lot of time to figure out that in BPRS data subjects were not individual so there was two subject = 1 persons. I figured this out just in the end. I had fun still in spite of the challenges.